Springboard Time Series - 'Cowboy Cigarettes' Case Study - Tier 3

Brief

You're working in the US federal government as a data scientist in the Health and Environment department. You've been tasked with determining whether sales for the oldest and most powerful producers of cigarettes in the country are increasing or declining.

Cowboy Cigarettes (TM, est. 1890) is the US's longest-running cigarette manufacturer. Like many cigarette companies, however, they haven't always been that public about their sales and marketing data. The available post-war historical data runs for only 11 years after they resumed production in 1949; stopping in 1960 before resuming again in 1970. Your job is to use the 1949-1960 data to predict whether the manufacturer's cigarette sales actually increased, decreased, or stayed the same. You need to make a probable reconstruction of the sales record of the manufacturer - predicting the future, from the perspective of the past - to contribute to a full report on US public health in relation to major cigarette companies.

The results of your analysis will be used as part of a major report relating public health and local economics, and will be combined with other studies executed by your colleagues to provide important government advice.


As ever, this notebook is tiered, meaning you can elect that tier that is right for your confidence and skill level. There are 3 tiers, with tier 1 being the easiest and tier 3 being the hardest.

1. Sourcing and loading

  • Load relevant libraries
  • Load the data
  • Explore the data

2. Cleaning, transforming and visualizing

  • Dropping unwanted columns
  • Nomenclature
  • Type conversions
  • Making a predictor variable y
  • Getting summary statistics for y
  • Plotting y

3. Modelling

  • Decomposition
    • Trend
    • Seasonality
    • Noise
  • Testing for stationarity with KPSS
  • Making the data stationary
  • The ARIMA Model
    • Make a function to find the MSE of a single ARIMA model
    • Make a function to evaluate the different ARIMA models with different p, d, and q values
  • Visualize the results
  • Application: Forecasting

4. Evaluating and concluding

  • What is our conclusion?
  • Next steps

0. Preliminaries

Time series data is just any data displaying how a single variable changes over time. It comes as a collection of metrics typically taken at regular intervals. Common examples of time series data include weekly sales data and daily stock prices. You can also easily acquire time series data from Google Trends, which shows you how popular certain search terms are, measured in number of Google searches.

1. Sourcing and Loading

1a. Load relevant libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline

1b. Load the data

Call the variable cigData.

In [2]:
cigData = pd.read_csv('CowboyCigsData.csv',index_col=0)

1c. Explore the data

We now need to check whether the data conduces to a time series style analysis.

In [3]:
cigData.head()
Out[3]:
Time #CigSales
0 1949-01 1000112
1 1949-02 1000118
2 1949-03 1000132
3 1949-04 1000129
4 1949-05 1000121

Over a million cigarettes sold in the month of January 1949. This certainly is a popular cigarette brand.

Check out the columns feature of the data. How many columns are there?

In [4]:
cigData.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Time       144 non-null    object
 1   #CigSales  144 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 3.4+ KB

There are 2 columns

Let's check out the data types of our columns.

In [5]:
cigData.dtypes
Out[5]:
Time         object
#CigSales     int64
dtype: object

Check whether there are any null values.

In [6]:
cigData.isnull().sum()
Out[6]:
Time         0
#CigSales    0
dtype: int64

There is no null in the columns.

2. Cleaning, transforming and visualizing

2a. Dropping unwanted columns

We need to cut that Unnamed: 0 column. Delete it here.

It's already dropped.

2b. Nomenclature

We can see that the Time column actually has the granularity of months. Change the name of that column to Month.

In [7]:
cigData = cigData.rename(columns={'Time': 'Month'})

Call a head() to check this has worked.

In [8]:
cigData.head()
Out[8]:
Month #CigSales
0 1949-01 1000112
1 1949-02 1000118
2 1949-03 1000132
3 1949-04 1000129
4 1949-05 1000121

2c. Type conversions

Now, do time series analysis on a Pandas dataframe is overkill, and is actually counter-productive. It's much more easy to carry out this type of analysis if we convert our data to a series first.

Notice that the Month field was an object. Let's type convert the Month column to a Python datetime, before making that the index.

In [9]:
cigData['Month'] = pd.to_datetime(cigData['Month'])
cigData.set_index('Month', inplace = True)

Perfect!

In [10]:
cigData.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 144 entries, 1949-01-01 to 1960-12-01
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   #CigSales  144 non-null    int64
dtypes: int64(1)
memory usage: 2.2 KB

2d. Making a predictor variable y

The data is now indexed by date, as time series data ought to be.

Since we want to predict the number of cigarette sales at Cowboy cigarettes, and y is typically used to signify a predictor variable, let's create a new variable called y and assign the indexed #Passenger column.

In [11]:
y = cigData['#CigSales']

Check the type of our new variable.

In [12]:
type(y)
Out[12]:
pandas.core.series.Series

2e. Getting summary statistics for y

Get the summary statistics of our data here.

In [13]:
y.describe()
Out[13]:
count    1.440000e+02
mean     1.000280e+06
std      1.199663e+02
min      1.000104e+06
25%      1.000180e+06
50%      1.000266e+06
75%      1.000360e+06
max      1.000622e+06
Name: #CigSales, dtype: float64

Try visualizing the data. A simple matplotlib plot should do the trick.

2f. Plotting y

In [14]:
y.plot();

3. Modelling

3a. Decomposition

What do you notice from the plot? Take at least 2 minutes to examine the plot, and write down everything you observe.

All done?

We can see that, generally, there is a trend upwards in cigarette sales from at Cowboy Cigarettes. But there are also some striking - and perhaps unexpected - seasonal fluctuations. These seasonal fluctations come in a repeated pattern. Work out when these seasonal fluctuations are happening, and take 2 minutes to hypothesize on their cause here.

What does it mean to decompose time series data? It means breaking that data into 3 components:

  1. Trend: The overall direction that the data is travelling in (like upwards or downwards)
  2. Seasonality: Cyclical patterns in the data
  3. Noise: The random variation in the data

We can treat these components differently, depending on the question and what's appropriate in the context. They can either be added together in an additive model, or multiplied together in a multiplicative model.

Make a coffee, take 5 minutes and read this article and think about whether our data would conduce to an additive or multiplicative model here. Write your conclusion down just here:


All done? Well, just on the basis of the plot above, it seems our Cowboy Cigarettes data is actually multiplicative.

That's because, as time progresses, the general trend seems to be increasing at a rate that's also increasing. We also see that the seasonal fluctuations (the peaks and troughs) get bigger and bigger as time progresses.

Now on the other hand, if the data were simply additive, we could expect the general trend to increase at a steadily, and a constant speed; and also for seasonal ups and downs not to increase or decrease in extent over time.

Happily, we can use the decompose() function to quantify the component parts described above in our data.

In [15]:
from statsmodels.tsa.seasonal import seasonal_decompose

# Plot the original data, the trend, the seasonality, and the residuals 
result = seasonal_decompose(y, model='multiplicative')  # model='mul' also works
result.plot();

3b. Testing for stationarity with KPSS

As you know, when doing time series analysis we always have to check for stationarity. Imprecisely, a time series dataset is stationary just if its statistical features don't change over time. A little more precisely, a stationary time series dataset will have constant mean, variance, and covariance.

There are many ways to test for stationarity, but one of the most common is the KPSS test. The Null hypothesis of this test is that the time series data in question is stationary; hence, if the p-value is less than the significance level (typically 0.05, but we decide) then we reject the Null and infer that the data is not stationary.

In [16]:
from statsmodels.tsa.stattools import kpss
kpss(y)
/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/tsa/stattools.py:1875: FutureWarning: The behavior of using nlags=None will change in release 0.13.Currently nlags=None is the same as nlags="legacy", and so a sample-size lag length is used. After the next release, the default will change to be the same as nlags="auto" which uses an automatic lag length selection method. To silence this warning, either use "auto" or "legacy"
  warnings.warn(msg, FutureWarning)
/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/tsa/stattools.py:1907: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned.

  warn_msg.format(direction="smaller"), InterpolationWarning
Out[16]:
(1.0521750110138661,
 0.01,
 14,
 {'10%': 0.347, '5%': 0.463, '2.5%': 0.574, '1%': 0.739})

Since our p-value is less than 0.05, we should reject the Null hypothesis and deduce the non-stationarity of our data.

But our data need to be stationary! So we need to do some transforming.

3c. Making the data stationary

Let's recall what it looks like.

In [17]:
y.plot();

In our plot, we can see that both the mean and the variance increase as time progresses. At the moment, our data has neither a constant mean, nor a constant variance (the covariance, however, seems constant).

One often used way of getting rid of changing variance is to take the natural log of all the values in our dataset. Let's do this now.

In [18]:
# Declare a variable called y_log
y_log = np.log(y)

When you plot this, you can see how the variance in our data now remains contant over time.

In [19]:
y_log.plot();

We now have a constant variance, but we also need a constant mean.

We can do this by differencing our data. We difference a time series dataset when we create a new time series comprising the difference between the values of our existing dataset.

Python is powerful, and we can use the diff() function to do this. You'll notice there's one less value than our existing dataset (since we're taking the difference between the existing values).

In [20]:
kpss(y_log.diff().dropna())
/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/tsa/stattools.py:1875: FutureWarning: The behavior of using nlags=None will change in release 0.13.Currently nlags=None is the same as nlags="legacy", and so a sample-size lag length is used. After the next release, the default will change to be the same as nlags="auto" which uses an automatic lag length selection method. To silence this warning, either use "auto" or "legacy"
  warnings.warn(msg, FutureWarning)
/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/statsmodels/tsa/stattools.py:1911: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.

  warn_msg.format(direction="greater"), InterpolationWarning
Out[20]:
(0.05301079859857677,
 0.1,
 14,
 {'10%': 0.347, '5%': 0.463, '2.5%': 0.574, '1%': 0.739})

Our p-value is now greater than 0.05, so we can accept the null hypothesis that our data is stationary.

3d. The ARIMA model

Recall that ARIMA models are based around the idea that it's possible to predict the next value in a time series by using information about the most recent data points. It also assumes there will be some randomness in our data that can't ever be predicted.

We can find some good parameters for our model using the sklearn and statsmodels libraries, and in particular mean_squared_error and ARIMA.

In [21]:
# Import mean_squared_error and ARIMA
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima_model import ARIMA

3di. Make a function to find the MSE of a single ARIMA model

Things get intricate here. Don't worry if you can't do this yourself and need to drop down a Tier.

In [22]:
# Make a function called evaluate_arima_model to find the MSE of a single ARIMA model 
def evaluate_arima_model(data, arima_order):
    # Needs to be an integer because it is later used as an index.
    # Use int()
    split=int(len(data) * 0.8) 
    # Make train and test variables, with 'train, test'
    train, test = data[0:split], data[split:len(data)]
    past=[x for x in train]
    # make predictions
    predictions = list()
    for i in range(len(test)):#timestep-wise comparison between test data and one-step prediction ARIMA model. 
        model = ARIMA(past, order=arima_order)
        model_fit = model.fit(disp=0)
        future = model_fit.forecast()[0]
        predictions.append(future)
        past.append(test[i])
    # calculate out of sample error
    error = mean_squared_error(test, predictions)
    # Return the error
    return error

3dii. Make a function to evaluate the different ARIMA models with different p, d, and q values

In [23]:
# Make a function to evaluate different ARIMA models with several different p, d, and q values.

def evaluate_models(dataset, p_values, d_values, q_values):
    best_score, best_cfg = float("inf"), None
    # Iterate through p_values
    for p in p_values:
        # Iterate through d_values
        for d in d_values:
            # Iterate through q_values
            for q in q_values:
                # p, d, q iterator variables in that order
                order = (p,d,q)
                try:
                    # Make a variable called mse for the Mean squared error
                    mse = evaluate_arima_model(dataset, order)
                    if mse < best_score:
                        best_score, best_cfg = mse, order
                    print('ARIMA%s MSE=%.3f' % (order,mse))
                except:
                    continue
    return print('Best ARIMA%s MSE=%.3f' % (best_cfg, best_score))
In [24]:
# Now, we choose a couple of values to try for each parameter.
p_values = [x for x in range(0, 3)]
d_values = [x for x in range(0, 3)]
q_values = [x for x in range(0, 3)]
In [25]:
# Finally, we can find the optimum ARIMA model for our data.
# Nb. this can take a while...!

import warnings
warnings.filterwarnings("ignore")
evaluate_models(y_log, p_values, d_values, q_values)
ARIMA(0, 0, 0) MSE=0.000
ARIMA(0, 0, 1) MSE=0.000
ARIMA(0, 1, 0) MSE=0.000
ARIMA(0, 1, 1) MSE=0.000
ARIMA(0, 1, 2) MSE=0.000
ARIMA(0, 2, 0) MSE=0.000
ARIMA(0, 2, 1) MSE=0.000
ARIMA(0, 2, 2) MSE=0.000
ARIMA(1, 0, 0) MSE=0.000
ARIMA(1, 0, 1) MSE=0.000
ARIMA(1, 0, 2) MSE=0.000
ARIMA(1, 1, 0) MSE=0.000
ARIMA(1, 1, 1) MSE=0.000
ARIMA(1, 2, 0) MSE=0.000
ARIMA(2, 0, 0) MSE=0.000
ARIMA(2, 0, 1) MSE=0.000
ARIMA(2, 1, 0) MSE=0.000
ARIMA(2, 1, 1) MSE=0.000
ARIMA(2, 1, 2) MSE=0.000
ARIMA(2, 2, 0) MSE=0.000
Best ARIMA(2, 1, 1) MSE=0.000

So the best p,d, q, parameters for our ARIMA model are 2, 1, 1 respectively. Now we know this, we can build the model.

In [26]:
p=2
d=1
q=1
model = ARIMA(y_log, order=(p,d,q))
model_fit = model.fit()
forecast = model_fit.forecast(24)
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            4     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f= -8.83293D+00    |proj g|=  1.58526D+03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    4      2     24      1     0     0   5.153D+01  -8.833D+00
  F =  -8.8329815468247670     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
 This problem is unconstrained.

 Warning:  more than 10 function and gradient
   evaluations in the last line search.  Termination
   may possibly be caused by a bad search direction.

We can take a look at a summary of the model this library has built around our data.

In [27]:
# Call summary() on model_fit
model_fit.summary()
Out[27]:
ARIMA Model Results
Dep. Variable: D.#CigSales No. Observations: 143
Model: ARIMA(2, 1, 1) Log Likelihood 1263.116
Method: css-mle S.D. of innovations 0.000
Date: Sun, 31 Oct 2021 AIC -2516.233
Time: 14:14:05 BIC -2501.418
Sample: 02-01-1949 HQIC -2510.213
- 12-01-1960
coef std err z P>|z| [0.025 0.975]
const 2.624e-06 5.06e-07 5.184 0.000 1.63e-06 3.62e-06
ar.L1.D.#CigSales 0.4681 0.156 3.003 0.003 0.163 0.774
ar.L2.D.#CigSales -0.2640 0.109 -2.413 0.016 -0.478 -0.050
ma.L1.D.#CigSales -0.8693 nan nan nan nan nan
Roots
Real Imaginary Modulus Frequency
AR.1 0.8866 -1.7326j 1.9463 -0.1747
AR.2 0.8866 +1.7326j 1.9463 0.1747
MA.1 1.1504 +0.0000j 1.1504 0.0000

3e. Visualize the results

Visualize the original dataset plotted against our model.

In [28]:
# Call figure() and plot() on the plt
plt.figure(figsize=(15,10))
plt.plot(y_log.diff())
plt.plot(model_fit.predict(), color = 'red')
Out[28]:
[<matplotlib.lines.Line2D at 0x7fc64401d910>]

3f. Application: Forecasting

We've done well: our model fits pretty closely to our existing data. Let's now use it to forecast what's likely to occur in future.

In [29]:
# Declare a variable called forecast_period with the amount of months to forecast, and
# create a range of future dates that is the length of the periods you've chosen to forecast
forecast_period = 24
date_range = pd.date_range(y_log.index[-1], periods = forecast_period, 
              freq='MS').strftime("%Y-%m-%d").tolist()

# Convert that range into a dataframe that includes your predictions
future_months = pd.DataFrame(date_range, columns = ['Month'])
future_months['Month'] = pd.to_datetime(future_months['Month'])
future_months.set_index('Month', inplace = True)
future_months['Prediction'] = forecast[0]

                 
In [30]:
# Plot your future predictions
plt.figure(figsize=(15,10))
plt.plot(y_log)
plt.plot(y_log['Nov 1960'].append(future_months['Prediction']))
plt.show()
In [31]:
# Now plot the original variable y 
# Use the same functions as before
plt.figure(figsize=(15,10))
plt.plot(y)
plt.plot(np.exp(y_log['Nov 1960'].append(future_months['Prediction'])))
plt.show()

4. Evaluating and Concluding

Our model captures the centre of a line that's increasing at a remarkable rate. Cowboy Cigarettes sell more cigarettes in the summer, perhaps due to the good weather, disposable income and time off that people enjoy, and the least in the winter, when people might be spending less and enjoying less free time outdoors.

Remarkably, our ARIMA model made predictions using just one variable. We can only speculate, however, on the causes of the behaviour predicted by our model. We should also take heed that spikes in data, due to sudden unusual circumstances like wars, are not handled well by ARIMA; and the outbreak of the Vietnam War in the 1960s would likely cause our model some distress.

We could suggest to our employers that, if they are interested in discovering the causes of the cigarette sales trajectory, they execute a regression analysis in addition to the time series one.